-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-8288][SQL] ScalaReflection can use companion object constructor #23062
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #98940 has finished for PR 23062 at commit
|
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably OK. My concern is whether this still works in Scala 2.11 and 2.12, and some style issues
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/ScalaReflectionSuite.scala
Outdated
Show resolved
Hide resolved
...talyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ObjectExpressionsSuite.scala
Show resolved
Hide resolved
|
Thanks much for the prompt feedback. PTAL at changes. I had to modify the ScroogeLikeExample class to make the additional DatasetSuite test work-- my prior example had removed methods from the scrooge code that were actually necessary. Kept as separate commits for review, but can squash later? RE Scala 2.12 are there any specific things you are worried about I could look at? I'm not super familiar with changes in 2.12, and I think 2.11 should be fine, as this was developed using 2.11. And I assume the QA build is testing on 2.12 also? Also this might not be the place to ask but it is rather ungreppable-- have you encountered compilation generated java files in ./org/ (that directory is also not in .gitignore!) breaking compilation in sbt? In my development environment I'll frequently have to delete that entire directory for compilation to work again. |
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have specific concern about Scala 2.11 or 2.12, just know that this is the kind of thing that can break across versions. The PR builder will use 2.12 and we do have other tests for 2.11.
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/ScalaReflectionSuite.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/ScalaReflectionSuite.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/ScalaReflectionSuite.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/ScalaReflectionSuite.scala
Show resolved
Hide resolved
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/ScalaReflectionSuite.scala
Outdated
Show resolved
Hide resolved
|
Test build #99033 has finished for PR 23062 at commit
|
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala
Show resolved
Hide resolved
| override def canEqual(other: Any): Boolean = other.isInstanceOf[ScroogeLikeExample] | ||
|
|
||
| private def _equals(x: ScroogeLikeExample, y: ScroogeLikeExample): Boolean = | ||
| x.productArity == y.productArity && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, but you've already established it's a ScroogeLikeExample here. Why must it be Product1 just to check whether it's also Product1? seems like it's not adding anything. In fact, why not just compare the one element that this trait knows about? to the extent it can implement equals() meaningfully, that's all it is doing already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My previous answer was not complete. Product1 is also necessary so that the implicit Encoders.product[T <: Product : TypeTag] will work with this class, if omitted the DatasetSuite test will not compile:
[error] /home/drew/spark/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala:1577: value toDS is not a member of Seq[org.apache.spark.sql.catalyst.ScroogeLikeExample]
[error] val ds = data.toDS
I could add some new encoder, but I think that might be worse as the goal of this PR is for Scrooge classes to work with the provided implicit encoders.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just use SparkSession.createDataset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, actually that probably won't work any more or less. OK, it's because there is an Encoder for Product. You can still simplify the equals() and so on I think, but looks like that's easier than a new Encoder. Or is it sufficient to test a Seq of a concrete subtype of ScroogeLikeExample?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm worried about changing the tests to use a concrete subtype, because the reflection calls might behave differently in that case either now or later on. I simplified a little more. canEqual is necessary to implement product. equals is necessary or tests will not pass (it will check object pointer equality), and hashCode is needed for scalastyle to pass since equals is necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's OK, leave in Product, if it's actually testing the case you have in mind. Yes I know equals() is needed. The new implementation looks good.
|
Test build #99039 has finished for PR 23062 at commit
|
|
Test build #99029 has finished for PR 23062 at commit
|
|
Test build #99031 has finished for PR 23062 at commit
|
|
Test build #99043 has finished for PR 23062 at commit
|
|
Test build #99044 has finished for PR 23062 at commit
|
| Option(ConstructorUtils.getMatchingAccessibleConstructor(cls, paramTypes: _*)) | ||
| def findConstructor[T](cls: Class[T], paramTypes: Seq[Class[_]]): Option[Seq[AnyRef] => T] = { | ||
| Option(ConstructorUtils.getMatchingAccessibleConstructor(cls, paramTypes: _*)) match { | ||
| case Some(c) => Some(x => c.newInstance(x: _*).asInstanceOf[T]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this cast isn't needed because Class[T].newInstance returns T already, but it's fine to leave.
|
Merged to master |
## What changes were proposed in this pull request? This change fixes a particular scenario where default spark SQL can't encode (thrift) types that are generated by twitter scrooge. These types are a trait that extends `scala.ProductX` with a constructor defined only in a companion object, rather than a actual case class. The actual case class used is child class, but that type is almost never referred to in code. The type has no corresponding constructor symbol and causes an exception. For all other purposes, these classes act just like case classes, so it is unfortunate that spark SQL can't serialize them nicely as it can actual case classes. For an full example of a scrooge codegen class, see https://gist.github.com/anonymous/ba13d4b612396ca72725eaa989900314. This change catches the case where the type has no constructor but does have an `apply` method on the type's companion object. This allows for thrift types to be serialized/deserialized with implicit encoders the same way as normal case classes. This fix had to be done in three places where the constructor is assumed to be an actual constructor: 1) In serializing, determining the schema for the dataframe relies on inspecting its constructor (`ScalaReflection.constructParams`). Here we fall back to using the companion constructor arguments. 2) In deserializing or evaluating, in the java codegen ( `NewInstance.doGenCode`), the type couldn't be constructed with the new keyword. If there is no constructor, we change the constructor call to try the companion constructor. 3) In deserializing or evaluating, without codegen, the constructor is directly invoked (`NewInstance.constructor`). This was fixed with scala reflection to get the actual companion apply method. The return type of `findConstructor` was changed because the companion apply method constructor can't be represented as a `java.lang.reflect.Constructor`. There might be situations in which this approach would also fail in a new way, but it does at a minimum work for the specific scrooge example and will not impact cases that were already succeeding prior to this change Note: this fix does not enable using scrooge thrift enums, additional work for this is necessary. With this patch, it seems like you could patch `com.twitter.scrooge.ThriftEnum` to extend `_root_.scala.Product1[Int]` with `def _1 = value` to get spark's implicit encoders to handle enums, but I've yet to use this method myself. Note: I previously opened a PR for this issue, but only was able to fix case 1) there: apache#18766 ## How was this patch tested? I've fixed all 3 cases and added two tests that use a case class that is similar to scrooge generated one. The test in ScalaReflectionSuite checks 1), and the additional asserting in ObjectExpressionsSuite checks 2) and 3). Closes apache#23062 from drewrobb/SPARK-8288. Authored-by: Drew Robb <[email protected]> Signed-off-by: Sean Owen <[email protected]>
### What changes were proposed in this pull request? Fixes encoding of classes that uses companion object constructors in the interpreted path. Without this change the that is added in this change would fail with ``` ... Cause: java.lang.RuntimeException: Error while decoding: java.lang.RuntimeException: Couldn't find a valid constructor on interface org.apache.spark.sql.catalyst.ScroogeLikeExample newInstance(interface org.apache.spark.sql.catalyst.ScroogeLikeExample) at org.apache.spark.sql.errors.QueryExecutionErrors$.expressionDecodingError(QueryExecutionErrors.scala:1199) ... ``` As far as I can tell this bug has existed since the initial implementation in SPARK-8288 #23062 The existing spec that tested this part of the code incorrectly provided an outerPointer which hid the bug from that test. ### Why are the changes needed? Fixes a bug, the new spec in the ExpressionsEncoderSuite shows that this is in fact a bug. ### Does this PR introduce _any_ user-facing change? Yes, it fixes a bug. ### How was this patch tested? New and existing specs in ExpressionEncoderSuite and ObjectExpressionsSuite. Closes #37837 from eejbyfeldt/spark-40385. Authored-by: Emil Ejbyfeldt <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
### What changes were proposed in this pull request? Fixes encoding of classes that uses companion object constructors in the interpreted path. Without this change the that is added in this change would fail with ``` ... Cause: java.lang.RuntimeException: Error while decoding: java.lang.RuntimeException: Couldn't find a valid constructor on interface org.apache.spark.sql.catalyst.ScroogeLikeExample newInstance(interface org.apache.spark.sql.catalyst.ScroogeLikeExample) at org.apache.spark.sql.errors.QueryExecutionErrors$.expressionDecodingError(QueryExecutionErrors.scala:1199) ... ``` As far as I can tell this bug has existed since the initial implementation in SPARK-8288 #23062 The existing spec that tested this part of the code incorrectly provided an outerPointer which hid the bug from that test. ### Why are the changes needed? Fixes a bug, the new spec in the ExpressionsEncoderSuite shows that this is in fact a bug. ### Does this PR introduce _any_ user-facing change? Yes, it fixes a bug. ### How was this patch tested? New and existing specs in ExpressionEncoderSuite and ObjectExpressionsSuite. Closes #37837 from eejbyfeldt/spark-40385. Authored-by: Emil Ejbyfeldt <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit 73e3c36) Signed-off-by: Hyukjin Kwon <[email protected]>
What changes were proposed in this pull request?
This change fixes a particular scenario where default spark SQL can't encode (thrift) types that are generated by twitter scrooge. These types are a trait that extends
scala.ProductXwith a constructor defined only in a companion object, rather than a actual case class. The actual case class used is child class, but that type is almost never referred to in code. The type has no corresponding constructor symbol and causes an exception. For all other purposes, these classes act just like case classes, so it is unfortunate that spark SQL can't serialize them nicely as it can actual case classes. For an full example of a scrooge codegen class, see https://gist.github.com/anonymous/ba13d4b612396ca72725eaa989900314.This change catches the case where the type has no constructor but does have an
applymethod on the type's companion object. This allows for thrift types to be serialized/deserialized with implicit encoders the same way as normal case classes. This fix had to be done in three places where the constructor is assumed to be an actual constructor:ScalaReflection.constructParams). Here we fall back to using the companion constructor arguments.NewInstance.doGenCode), the type couldn't be constructed with the new keyword. If there is no constructor, we change the constructor call to try the companion constructor.NewInstance.constructor). This was fixed with scala reflection to get the actual companion apply method.The return type of
findConstructorwas changed because the companion apply method constructor can't be represented as ajava.lang.reflect.Constructor.There might be situations in which this approach would also fail in a new way, but it does at a minimum work for the specific scrooge example and will not impact cases that were already succeeding prior to this change
Note: this fix does not enable using scrooge thrift enums, additional work for this is necessary. With this patch, it seems like you could patch
com.twitter.scrooge.ThriftEnumto extend_root_.scala.Product1[Int]withdef _1 = valueto get spark's implicit encoders to handle enums, but I've yet to use this method myself.Note: I previously opened a PR for this issue, but only was able to fix case 1) there: #18766
How was this patch tested?
I've fixed all 3 cases and added two tests that use a case class that is similar to scrooge generated one. The test in ScalaReflectionSuite checks 1), and the additional asserting in ObjectExpressionsSuite checks 2) and 3).